Processing All

Ilyes Baali

3/18/2021

Introduction

High-throughput whole transcriptome RNA sequencing (RNA-seq) technology has enabled a new era of biomedical and clinical research by providing unprecedented high-resolution view of the global transcriptional landscape(Kukurba and Montgomery 2015). For instance, the technology is used routinely to obtain the most comprehensive genome-wide view of the transcriptome, study differential expression patterns [cite many] and allele-specific expression (ASE), analyze splicing and alternative splicing mechanism [cite]. Beyond surveying gene expression levels, RNA-seq can also be applied to investigate genome-wide transcriptional changes, such as chimeric gene fusions, single nucleotide variants, insertions, and deletion events (Z. Wang, Gerstein, and Snyder 2009). Moreover, RNA-seq enabled many novel assays that can be used to to invistigate chromotaine accessibility (e.g ATAC-seq)(Buenrostro et al. 2015), transcriptor factor binding preferences (e.g ChIP-seq) (Park 2009), RNA binding proteins binding sites (e.g CLIP-seq) (Hafner et al. 2021). The capabilities of RNA-seq technology makes it an increasingly adapted technology for the systematic characterization of whole transcriptomes in many organisms and diseases (Marioni et al. 2008).

The consistency and the accuracy of RNA-seq data are critical to ensure that the conclusions of these studies are not flawed. Multiple library preparation steps may introduce potential factors that can confound the experiments, including sample preparation, sequencing platforms, and bioinformatics analysis tools. Many of these factors were studied in depth, but less attention was given to different sample preparation, including RNA isolation, sample handling, library storage time, RNA input level, and sample cryopreservation (L. Wang et al. 2019).

In this project, we investigate the potential consequences of cDNA library storage time and sample cryopreservation on gene level characterization of RNA-seq experiments. Our hypothesis is that RNA-seq data is sensitive to sample preparation factors and that can affect the gene expression profiles of the experiments..

Results

Another brief paragraph summarizing your key insights and possible future experiments/analyses that might enhance your own analysis.

The analysis of the possible effects of library storage time and sample cryopreservation shows that the two factors has minor to no effect on the RNA-seq experiments. Figure 1 and 2 shows PCA plot for studying the impact of library storage time and sample cryopreservation, respectively. We see that the samples from the same patient are closest despite the library storage time and sample cryopreservation. The same observation is seen when looking at the hierarchical clustering of the samples as shown in Figure 3 and 4. For future analysis, we may consider adding more samples to the experiment to get a higher confidence on the results. Also, for the library storage time, it would be better to get samples from different time points to point out the main features of the RNA-seq data that change with time. For sample cryopreservation study, we may consider compare different freezing technologies to make a more informed decision when choosing to store samples for an extended period of time. Other than these two factors that were studied here, we can also look at other library preparation steps like, RNA input concentration to obtain a full picture of possible confounding factors.

Figure 1: PCA plot for library storage time impact

Figure 2: PCA plot for cryopreservation impact

Figure 3: Dendrogram cluster for library storage time impact

Figure 4: Dendrogram cluster for cryopreservation impact

Methods

Dataset

The data that will be used for this project are from GSE110417. the data contains 21 samples, that are either B cells or CD4+ cells isolated from freshly drawn peripheral blood. 13 samples used to investigate the library storage time. The samples are grouped into two: the first group contains samples from 6 patients sequenced with original cDNA library, and the second group contains samples for the same patients but prepared with a cDNA library stored at -80 degrees for three years, with an addition of a sample from a patient prepared with a newly constructed cDNA library. Finally, to investigate the impact of the sample cryopreservation on RNA-seq experiments, fresh and cryopreserved CD4+ samples derived from 4 patients were used, totaling to 8 samples. The sample considered fresh were derived from freshly isolated cells and stored as cell lysates in Qiazol, while the samples considered frozen, peripheral blood mononuclear cells (PBMCs) were isolated from fresh whole blood then frozen in Cosmic calf serum containing 5% DMSO at −80 degrees. The cDNA libraries were prepared using mRNA TruSeq v.2 (Illumina) and Illumina HiSeq 2000 instrument was used to sequence all the samples.

Samples <- read.csv('SraRunTable.txt')
more_details <- read.csv('samples_title.csv')
Samples <- merge(x=Samples, y=more_details, by='Sample_geo_accession')
Samples <- Samples[, c('Sample_title','Run','Cell_type', 'AvgSpotLen', 'Platform', 'LibraryLayout','Instrument', 'LibrarySelection', 'LibrarySource')]
kable(Samples) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")
Sample_title Run Cell_type AvgSpotLen Platform LibraryLayout Instrument LibrarySelection LibrarySource
concentration_1 rep1 SRR6703961 B cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
concentration_100 rep1 SRR6703962 B cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
concentration_250 rep1 SRR6703963 B cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
concentration_500 rep1 SRR6703964 B cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
concentration_1 rep2 SRR6703965 B cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
concentration_100 rep 2 SRR6703966 B cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
concentration_100 rep2 SRR6703967 B cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
concentration_500 rep2 SRR6703968 B cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_101SR SRR6703969 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_101FR SRR6703970 CD4+ T cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_103SR SRR6703971 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_103FR SRR6703972 CD4+ T cell 50 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_104SR SRR6703973 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_104FR SRR6703974 CD4+ T cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_107SR SRR6703975 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_107FR SRR6703976 CD4+ T cell 100 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_107SRN SRR6703977 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_109SR SRR6703978 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_109FR SRR6703979 CD4+ T cell 50 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_110SR SRR6703980 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
time_110FR SRR6703981 CD4+ T cell 50 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
frozen_501 SRR6703982 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
frozen_501f SRR6703983 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
frozen_506 SRR6703984 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
frozen_506f SRR6703985 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
frozen_507 SRR6703986 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
frozen_507f SRR6703987 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
frozen_510 SRR6703988 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC
frozen_510f SRR6703989 CD4+ T cell 102 ILLUMINA PAIRED Illumina HiSeq 2000 cDNA TRANSCRIPTOMIC

Download FASTQ files

For my project the data I intend to use is stored in SRA repository. To download the data from SRA, I used sra-toolkit to fetch and extract the fastq files. First I obtained the SRR accession list for all the fastq files for the project. I obtained the list from SRA Run Selector using the GEO accession number: GSE110417 here.

The following command is used to download all the files:

prefetch --option-file SRR_Acc_List.txt

The downloaded files are not a proper .fastq files that can be processed, instead SRA-toolkit downloads .sra (Sequence Read Archive) files. The next step fastq-dump tool (there is a faster tool named fasterq-dump that was not found on the cluster) is use to extract the fastq files from these archive files. The following script is used to extract all the files:

#! /bin/bash -l
#SBATCH --partition=angsd_class #SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=SRATofastq
#SBATCH --time=24:00:00 # HH/MM/SS
#SBATCH --mem=8G # memory requested, units available: K,M,G,T
outFile=${DownloadSRA}_${SLURM_JOB_ID}.txt
echo "Starting at:" `date` >> ${outFile}
echo "This is job #:" $SLURM_JOB_ID >> ${outFile}
echo "Running on node:" `hostname` >> ${outFile}
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> ${outFile}
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> ${outFile}

spack load -r sra-toolkit
for file in *.sra; do
   echo "File Name: " $file >> ${outFile}
   fastq-dump --split-files --gzip $file
done
exit

The parameters that are used in the previous step are --split-files and --gzip. The first one is used to save the pair-end reads to two separate files, and the second is used to save the extracted fastq file to gzip compressed file. The documentation from the tool are displayed below:

--split-files    Dump each read into separate file.Files 
                 will receive suffix corresponding to read 
                 number 
--gzip           Compress output using gzip                  

Data Processing

QC Analysis

First, I took a look at the FastQC analysis of one sample and found that there are were two sequences over-represented, so I decided to run Trimglore on all sequences. Figures 5-9 shows the results of FastQC analysis for all the library storage time samples reported using MultiQC. The samples for cryopreservation experiments shows the same trends as well. Figure 10 shows the length of trimmed sequences after running Trimglore.

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

Generate Index

Next step was to create the index files for our specie’s genome to align the reads. Since the samples for the project are from human, the human genome hg38 build was downloaded. For the annotation file we have different choices that we can use depending on the use case. Since we are not dealing with special annotation Ensemble gene annotation file hg38.ensGene.gtf.gz was used. The files were download from UCSC Browser. The following script was used to generate the index:

#! /bin/bash -l

#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --job-name=IndexGenerator
#SBATCH --time=12:00:00 # HH/MM/SS
#SBATCH --mem=36G # memory requested, units available: K,M,G,T
outFile=IndexAlign_${SLURM_JOB_ID}.txt
echo "Starting at:" `date` >> ${outFile}
echo "This is job #:" $SLURM_JOB_ID >> ${outFile}
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> ${outFile}
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> ${outFile}

spack load star@2.7.0e


mkdir ${TMPDIR}/ilb4001
tmp=/scratchLocal/ilb4001

echo "Copying files to" ${tmp} >> ${outFile}

# Copy genome and annotation file
cp $SLURM_SUBMIT_DIR/hg38.fa ${SLURM_SUBMIT_DIR}/hg38.ensGene.gtf ${tmp}

# Copy sample read fastq files
cp $SLURM_SUBMIT_DIR/fastq_trimmed/SRR6703962_1_val_1.fq ${SLURM_SUBMIT_DIR}/fastq_trimmed/SRR6703962_2_val_2.fq ${tmp}

# Check that all the files were copied without an error
echo "tmp content..."
ls ${tmp} | cat >> ${outFile}

############################Indexing######################################
resultFolder=${tmp}/hg38_STARindex
mkdir ${resultFolder}

echo "STAR genomeGenerate...." >> ${outFile}

STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir ${resultFolder} \
--genomeFastaFiles hg38.fa \
--sjdbGTFfile hg38.ensGene.gtf \
--sjdbOverhang 49 \
|& tee ${outFile}


echo "copying the generated index...." >> ${outFile}
cp -r ${resultFolder}/* ${SLURM_SUBMIT_DIR}/genome/hg38_STARindex/

Apart from the input parameters, I set the --sjdbOverhang parameter to 49 since the length of the reads is 50 bp.

Alignment

After generating the index for the reference genome, the next step was to align the reads. For that I used STAR alignment tool. Note: the files are pair-end reads. Figure 11 shows the percentages of different aligned reads. We can see that more than 95% of the reads were mapped, and a small percentage of the reads was not mapped. The script for aligning and generating the QC plots is in the full script at the end of the section.

Figure 11: Percentages of STAR aligned reads

To get the distribution of mapped read counts in the different genomic regions, I used the annotation bed file for housekeeping genes hg38.HouseKeepingGenes.bed. The reason for doing that is to speed up the process, as using the annotations for the whole genome was taking way too much time to process. The results of the analysis is discussed in the following sections.

Feature Count

Once the reads were aligned, I used featureCounts tool to generate the feature count for each sample in my dataset. The main parameters I set for the tool are:

  1. -a: the gtf annotation file was set to hg38.ensGene.gtf
  2. -t: the feature level to use for reads counting was set to ‘exon’
  3. -p: Since we have pair-end reads, we count the fragments instead of reads. fragments are specified by the two paired-end reads

The script below was used to perform all the steps, except index generation.

#! /bin/bash -l

#SBATCH --partition=angsd_class
#SBATCH -w buddy.pbtech
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --job-name=ProcessAll
#SBATCH --time=48:00:00 # HH/MM/SS
#SBATCH --mem=36G # memory requested, units available: K,M,G,T
outFile=ProcessAll_${SLURM_JOB_ID}.txt
echo "SLURM_SUBMIT_DIR:" ${SLURM_SUBMIT_DIR}  >> ${outFile}
echo "Starting at:" `date` >> ${outFile}
echo "This is job #:" $SLURM_JOB_ID >> ${outFile}
echo "Running on node:" `hostname` >> ${outFile}
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> ${outFile}
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> ${outFile}

spack load -r fastqc
spack load -r trimgalore
spack load star@2.7.0e
spack load samtools@1.9% gcc@6.3.0
spack load subread
spack load -r py-rseqc@2.6.4

mkdir /scratchLocal/ilb4001
tmp=/scratchLocal/ilb4001
fatqDir=${tmp}/raw_data
mkdir ${fatqDir}

echo "Copying fastq files to" ${fatqDir} >> ${outFile}
cp $SLURM_SUBMIT_DIR/raw_data/SRR67*.fastq.gz ${fatqDir}

mkdir ${tmp}/fastqc
FastQCDir=${tmp}/fastqc

echo "Start fastqc...." >> ${outFile}
fastqc $(ls ${fatqDir}/SRR67*.fastq.gz) --extract -o ${FastQCDir} |$ tee ${outFile}

mkdir ${tmp}/fastq_trimmed
TrimDir=${tmp}/fastq_trimmed
echo "Start TrimGalore" >> ${outFile}
trim_galore --gzip --illumina  --fastqc --fastqc_args "--outdir ${FastQCDir}" --paired  --stringency 5 -o ${TrimDir}  $(ls ${fatqDir}/SRR67*.fastq.gz)


mkdir ${tmp}/genome
GenomeDir=${tmp}/genome
echo "Copying Genome files" >> ${outFile}
cp -r $SLURM_SUBMIT_DIR/genome/hg38_STARindex ${GenomeDir}
cp $SLURM_SUBMIT_DIR/genome/hg38.ensGene.gtf ${GenomeDir}
cp $SLURM_SUBMIT_DIR/genome/hg38.HouseKeepingGenes.bed ${GenomeDir}


AlignmDir=${tmp}/alignments
mkdir ${AlignmDir}
mkdir ${AlignmDir}/bamqc/
for sra in $(cat $SLURM_SUBMIT_DIR/raw_data/SRR_Acc_List.txt); do

     sampleName=$(basename ${TrimDir}/${sra}*_1.fq.gz | cut -d '.' -f 1)
    
     echo "Run STAR on " ${sampleName} >> ${outFile}
     STAR --runMode alignReads \
     --runThreadN 16 \
     --readFilesIn ${TrimDir}/${sra}*_val_1.fq.gz  ${TrimDir}/${sra}*_val_2.fq.gz \
     --readFilesCommand zcat \
     --genomeDir ${GenomeDir}/hg38_STARindex \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix ${AlignmDir}/${sampleName}.
    
     echo "Run samtools index on " ${sampleName} >> ${outFile}
     samtools index ${AlignmDir}/${sampleName}.Aligned.sortedByCoord.out.bam
    
     echo "Run bamqc on " ${sampleName} >> ${outFile}
     /softlib/apps/EL7/BamQC/bin/bamqc ${AlignmDir}/${sampleName}.Aligned.sortedByCoord.out.bam -o ${AlignmDir}/bamqc/
 done


mkdir ${tmp}/RSeqc
QCDir=${tmp}/RSeqc
BED=${GenomeDir}/hg38.HouseKeepingGenes.bed
for sra in $(cat $SLURM_SUBMIT_DIR/raw_data/SRR_Acc_List.txt); do
        
        # Create a directory to save the sample's results
        sampleName=$(basename ${TrimDir}/${sra}*_1.fq.gz | cut -d '.' -f 1)
        if [[ ! -e ${QCDir}/$sampleName ]]; then
                mkdir ${QCDir}/$sampleName
        fi

        # Calculate the read distribution
        read_distribution.py -i ${AlignmDir}/${sra}*.bam \
        -r ${BED} > \
        ${QCDir}/$sampleName/rseqc_read_distribution.out
        echo "read distribution End"  
        # Calculate the gene body coverage
        geneBody_coverage.py -i ${AlignmDir}/${sra}*.bam \
        -r ${BED} \
        -o ${QCDir}/$sampleName/rseqc_geneBody_coverage.out
        echo $sampleName "Done"
done


GTF=${GenomeDir}/hg38.ensGene.gtf
COUNTFILE=${tmp}/geneCounts.txt 
# Run featureCounts
featureCounts -a ${GTF} \
            -o ${COUNTFILE} \
            -t 'exon' \
            -p \
            --verbose \
            ${AlignmDir}/*.bam



echo "coying the results back...." >> ${outFile}
cp -r ${AlignmDir}/* ${SLURM_SUBMIT_DIR}/alignments

mkdir ${SLURM_SUBMIT_DIR}/qc

cp -r ${FastQCDir}/* ${SLURM_SUBMIT_DIR}/qc

cp -r ${TrimDir} ${SLURM_SUBMIT_DIR}

cp -r ${QCDir}  ${SLURM_SUBMIT_DIR}

cp ${tmp}/geneCounts* ${SLURM_SUBMIT_DIR}

echo "remove tmp directory" >> ${outFile}
rm -r ${tmp}
echo "Done!" >> ${outFile}

exit

The impact of the library storage time on RNA-seq experiments

To explore the effect of cDNA library storage time on RNA-seq output, I investigated the differences between two groups of CD4+ blood samples. The first group (orginal group) contains patient samples with original cDNA library, and the second group (3-year group) contains samples with the same cDNA library from the original group stored under -80^o for three years. In addition to that, one patient (ID:107) had a third sample was created by storing the RNA for three years, than used to construct the cDNA library. Figure 12 show an overview of the samples preparation and groups.

Figure 12: Sample Preparation Overview

First, I investigated the potential differences of sequencing biases between the two groups. For that I looked at the percentage of mapped reads across different genomic regions (i.e., intronic, exonic, intergenic and other regions) as shown in the Figure below. we can see that the distribution of the reads across genomic regions has little to no difference between the two groups.

read_dist[c('ID', "cond")]<- t(sapply(read_dist$donor, function(s) substring(s, c(1,4), c(3,6))))
pct_cols <- colnames(read_dist)[grepl("_pct", colnames(read_dist))]
df_read <- read_dist[(read_dist$cond == "FR") | (read_dist$cond == "SR") | (read_dist$cond == "SRN"), c("Sample","factor","donor", "ID", "cond",pct_cols )]
df_read <- gather(df_read, key="region", value="prct", X3_utr_exons_tag_pct:other_intergenic_tag_pct )
df_read$region <- gsub("_tag_pct", "", df_read$region)
old_regions <- unique(df_read$region)
new_regions <- c("3'UTR_Exons", "TES_down_5kb", "TSS_up_5kb","TES_down_1kb","CDS_exons", "5'UTR_exons","TSS_up_10kb","Introns","TSS_up_1kb","TES_down_10kb","Other_Intergenic")
mappings <- setNames(new_regions, old_regions)
args <- c(list(df_read$region), mappings)
df_read$region <- do.call(recode, args)

ggplot(df_read, aes(x=donor, y=prct, fill=region)) +
        scale_fill_brewer( type='div', palette = "RdYlBu") +
         geom_bar(position="stack", stat="identity") + 
        theme(plot.title = element_text(hjust = 0.5, face='bold'),
              legend.position="bottom", 
              axis.text=element_text(size=9, face='bold'),
              axis.title=element_text(size=12,face="bold"), 
              axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+ 
  
  xlab("Sample") + ylab("Percentages") +labs(title="Read Distribution")

ggsave("./figures/Read_distribution_time.png")

We also looked at the gene body coverage to see if there are any pronounce differences. The figure below shows the gene body coverage between the two groups. We see no differences between the two groups.

data <- fromJSON(file = "../project/multiqc/multiqc_data/multiqc_data.json")

gene_body_prc <- data$report_plot_data$rseqc_gene_body_coverage_plot$datasets[[1]]
gene_body_prc <- lapply(gene_body_prc, function(x) x$data)
gene_body_prc <- lapply(gene_body_prc, function(x) sapply(x, function(s) s[2]))

df <- data.frame(gene_body_prc)
colnames(df) <- read_dist$donor
df$x <- seq(1:100)
df <- gather(df, key = "Sample", value = "gene_body_prc", -x)
df[c('ID', "cond")]<- t(sapply(df$Sample, function(s) substring(s, c(1,4), c(3,6))))

df_time <-  df[(df$cond == "FR") | (df$cond == "SR") | (df$cond == "SRN"), ]
     
p1 <- ggplot(df_time, aes(x = x, y = gene_body_prc, group=Sample, color=cond)) + 
  geom_line()+ 
  theme( plot.title = element_text(hjust = 0.5, face='bold'),
         axis.text=element_text(size=9, face='bold'),
        axis.title=element_text(size=12,face="bold"), 
        axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+ 
  xlab("Gene Body Percentile (5' -> 3')") + ylab("%Coverage") +labs(title="Gene Body Coverage")+
  scale_colour_brewer( type='div', palette = "Set1")

p1 

#ggsave('Gene_Body_Coverage_time.png')

Gene Level Charachterization

In this section we look at the gene expression abundance and differentially expressed genes between the two groups. First, we used CPM normalized count to compare the gene counts between the samples of the two groups. The density distribution of gene expression values was highly similar between all the samples. The expression of the genes that are either protein coding or lincRNA was concordant between all the samples as shown in the figures below.

folder <- "./"
readcounts <- paste0(folder,"geneCounts_v2.txt") %>% read.table(., header=TRUE) 
readcounts <- readcounts[, -c(7)]
keep_cols <- colnames(readcounts)[grepl("time", colnames(readcounts))]
readcounts <- readcounts[, c(names(readcounts)[1:6], keep_cols)]
sample_names <- unlist(lapply(strsplit(colnames(readcounts)[7:19], "[.]"), function(x) x[5]))
sample_names <- unlist(lapply(strsplit(sample_names, "_"), function(x) paste(x[2:(length(x)-3)], collapse="_")))
names(readcounts) <- c(names(readcounts)[1:6], sample_names)
row.names(readcounts) <- make.names(readcounts$Geneid)
readcounts <- readcounts[ , -c(1:6)]
cpm <- data.frame(assay(DESeq.ds, "CPM") )
colnames(cpm) <- colnames(DESeq.ds)
cpm$GENEID <- rownames(cpm)
df_cpm <- cpm %>%
  gather(colnames(cpm), key = "Sample", value = "CPM", -GENEID)
df_cpm[c('ID', "cond")]<- t(sapply(df_cpm$Sample, function(s) substring(s, c(1,4), c(3,6))))
ggplot(df_cpm, aes(x=CPM,group=Sample, color=cond )) + 
  geom_density() +
  scale_color_brewer( type='div', palette = "Set1") +
  geom_rug(sides="b") + 
  scale_x_log10()+
  theme(plot.title = element_text(hjust = 0.5, face='bold'))+
  labs(title='Density distribution of the gene CPM value of all samples', caption = "FR:original libraries; SR:3-year stored libraries")+
  ylab('Density')

#ggsave("Density_distribution_time.png")
p1 <- ggplot(df_cpm, aes(y=log10(CPM), x=Sample, fill=cond )) +
  geom_boxplot() +
  scale_fill_brewer( type='div', palette = "Set1") +
  theme(plot.title = element_text(hjust = 0.5, face='bold'),
        axis.text=element_text(size=9, face='bold'),
        axis.title=element_text(size=12,face="bold"), 
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title="Distribution of the gene CPM value of all samples")

df_cpm2 <- merge(x= df_cpm, y= gene_details, by = "GENEID", all.x = TRUE)

p2 <- ggplot(df_cpm2[!is.na(df_cpm2$TXBIOTYPE), ], aes(y=log10(CPM), x=TXBIOTYPE, fill=Sample)) +
  geom_boxplot() +
  scale_fill_brewer( type='div', palette = "RdYlBu") +
  theme(plot.title = element_text(hjust = 0.5, face='bold'),
        axis.text=element_text(size=11, face='bold'),
        axis.title=element_text(size=9,face="bold"), 
        axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
  xlab(label = "") + labs(title= "Comparison of the CPM values between\n lincRNAs and protein-coding genes ", caption = "FR:original libraries; SR:3-year stored libraries")

(p1 | p2)  + plot_annotation(tag_levels = 'A')

#ggsave('CPM_Boxplots.png')

Furthermore, three RNA libraries of donor ID:107 (i.e., 107FR for the original library, 107SR for the 3-year library, and 107SRN for the new constructed library from the same RNA after three years), exhibited highly correlated expression patterns as shown in the following figure.

df <- DESeq.ds[, c("107SR","107FR")] %>% assay(., "CPM") %>% data.frame
colnames(df) <-  c("107SR","107FR")

p1 <- ggplot(df, aes(x=`107SR`, y=`107FR`)) + geom_point(size=3)+
  stat_smooth(method="lm", se=FALSE, formula = y ~ x) +
  stat_fit_glance( aes(label = sprintf('r^2~"="~%.3f~~italic(P)~"="~%.2g', stat(r.squared), stat(p.value))), parse = TRUE, size = 5)

df <- DESeq.ds[, c("107SR","107SRN")] %>% assay(., "CPM") %>% data.frame
colnames(df) <-  c("107SR","107SRN")

p2 <- ggplot(df, aes(x=`107SR`, y=`107SRN`)) + geom_point(size=3)+
  stat_smooth(method="lm", se=FALSE, formula = y ~ x) +
  stat_fit_glance( aes(label = sprintf('r^2~"="~%.3f~~italic(P)~"="~%.2g', stat(r.squared), stat(p.value))), parse = TRUE, size = 5)


df <- DESeq.ds[, c("107FR","107SRN")] %>% assay(., "CPM") %>% data.frame
colnames(df) <-  c("107FR","107SRN")

p3 <- ggplot(df, aes(x=`107FR`, y=`107SRN`)) + geom_point(size=3)+
  stat_smooth(method="lm", se=FALSE, formula = y ~ x) +
  stat_fit_glance( aes(label = sprintf('r^2~"="~%.3f~~italic(P)~"="~%.2g', stat(r.squared), stat(p.value))), parse = TRUE, size = 5)

(p1 | p2 | p3) + plot_annotation(tag_levels = 'A')

#ggsave("corr_107.png")

Next, we looked at the PCA clustering of the samples to see if they cluster based on the condition (i.e., original library, stored library) or the patient/donor. The PCA plot suggests that the two RNA-seq samples from the same patient are closest despite the difference of their storage conditions. The same thing is shown by looking at the pearson correlation matrix between the samples and the hierarchical clustering as shown in the figures below. For the three plots I used rlog transformed counts since we are comparing the gene expression profiles between samples.

# For PCA and DE analysis the rlog normalized counts are used.

DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)

rv <- rowVars(assay(DESeq.rlog))

# Obtain the indecies of the top 500 variable genes
top_variable <- order(rv, decreasing = TRUE)[seq_len(500)]

# Compute the PCAs based on the rlog normalized gene expression
pca <- prcomp(t(assay(DESeq.rlog)[top_variable, ]))

# data.frame contains informtion for each sample. it will be used for PCA plot
data <- data.frame(colData(DESeq.ds))
# Rename columns 
colnames(data) <- c("factor","donor", "ID", "Condition" )

# Plot the two top PCs
p1 <- autoplot(pca, data=data, colour = 'Condition', label = TRUE, label.size = 4, shape = FALSE) +
  scale_colour_brewer( type='qual', palette = "Set1") +
  theme_classic()+
  theme(legend.position = "none", 
        plot.title = element_text(hjust = 0.5, face='bold')) + 
  labs(title="PCA plot for Library Storage Time Imapct Samples", caption = "FR:original libraries; SR:3-year libraries")

#ggsave('PCA_time.png')
p1

sample_df <- as.data.frame(colData(DESeq.ds)[c('ID','cond')])
colnames(sample_df) <- c('Donor ID', 'Condition')
ann_colors = list(Condition = c(FR = "#8DA0CB", SR = "#FC8D62", SRN="#66C2A5"))

corr_coeff <- cor(assay(DESeq.rlog), method = "pearson") 

p1 <- as.dist(1-corr_coeff, upper = TRUE) %>% 
  as.matrix %>%
pheatmap::pheatmap(., main = "Pearson Distance", annotation_row = sample_df, annotation_colors = ann_colors)

# Run pvclust 
d.pv <- as.dendrogram(pvclust(corr_coeff, nboot=10, quiet=T)$hclust)
ddata <- dendro_data(d.pv, type = "rectangle")
# Get data frames to plot
df_seg <- segment(ddata)
df_labs <- data.frame(label(ddata), ID = as.factor(sample_df[match(label(ddata)$label, rownames(sample_df)), "Donor ID"]))

# Create ggplot dendrogram
p <- ggplot()+ 
  geom_segment(data = df_seg,
               aes(x = x, y = y, xend = xend, yend = yend),
               size = 1.25,
               colour = "darkgray", lineend = "round") +
  geom_text(data = df_labs, aes(x = x, y = y, label = label, colour = ID), 
            nudge_y = 0,
            family = "serif", size = 4, angle = 90, hjust = 1, fontface='bold') + 
  scale_colour_brewer( type='div', palette = "Dark2") +
  xlab("") + ylab("Height")
p <- p + theme(axis.line.x = element_blank(),
               axis.text.x = element_blank(),
               axis.ticks.x = element_blank(),
               text = element_text(family = "serif")) +
  # theme_gray()+
  scale_y_continuous(expand = expand_scale(add = c(0.25, 0)))+
  theme_classic() + 
  theme(axis.line.x = element_blank(), axis.ticks.x =element_blank(), 
        axis.text.x= element_blank(),legend.position = "none", 
        plot.title = element_text(hjust = 0.5, face='bold')) +
  labs(title = "Cluster Dendrogram")

#ggsave('dendrogram_time.png')
p

Lastly, we looked at the differential gene expression between the original and 3-years group. We found 15 differentially expressed genes between groups. This suggest that cDNA library storage time has not introduced any biases in the samples that may effect the analysis when comparing the samples from different patient. A summary of the DE analysis is shown in the table below.

DE_ds <- DESeq.ds[, DESeq.ds$cond != "SRN" ]
colData(DE_ds)$cond <- factor(as.character(colData(DESeq.ds[, DESeq.ds$cond != "SRN" ])$cond), levels=c("FR", "SR"))

res <- DESeq(DE_ds)
res_df <- as.data.frame(results(res))
res_df$SYMBOL <- gene_symbol[match(rownames(res_df), names(gene_symbol))]
res_df <- res_df[(res_df$padj < 0.05)  & (!is.na(res_df$padj)), ]
res_df <- res_df[order(res_df$padj),]
kable(res_df) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")
baseMean log2FoldChange lfcSE stat pvalue padj SYMBOL
ENSG00000213399 34.560165 5.219482 0.5226264 9.987022 0.00e+00 0.0000000 AC022210.2
ENSG00000229117 6795.090428 1.215919 0.1274798 9.538131 0.00e+00 0.0000000 RPL41
ENSG00000212664 35.855870 -2.475455 0.3218213 -7.692018 0.00e+00 0.0000000 RP11-592N21.1
ENSG00000273136 249.350865 1.871904 0.2539311 7.371701 0.00e+00 0.0000000 NBPF26
ENSG00000229994 18.202757 2.644336 0.3775303 7.004303 0.00e+00 0.0000000 RPL5P4
ENSG00000225573 10.533188 3.697417 0.6551633 5.643504 0.00e+00 0.0000574 RPL35P5
ENSG00000109471 37.953389 -3.297028 0.6056449 -5.443831 1.00e-07 0.0001541 IL2
ENSG00000111537 34.632150 -1.348320 0.2538339 -5.311820 1.00e-07 0.0002245 IFNG
ENSG00000223551 24.775425 1.803860 0.3378907 5.338590 1.00e-07 0.0002245 TMSB4XP4
ENSG00000227008 44.714744 1.079535 0.2032282 5.311933 1.00e-07 0.0002245 RP3-417G15.1
ENSG00000218426 157.384940 -1.008331 0.2224115 -4.533628 5.80e-06 0.0109001 RP11-475C16.1
ENSG00000164400 5.651494 -3.568857 0.8237855 -4.332265 1.48e-05 0.0254335 CSF2
ENSG00000225131 9.828745 1.868007 0.4333528 4.310593 1.63e-05 0.0259005 PSME2P2
ENSG00000229204 4.675141 -3.583913 0.8463938 -4.234333 2.29e-05 0.0338607 PTGES3P3
ENSG00000238000 7.993616 2.175253 0.5169151 4.208144 2.57e-05 0.0354974 RP11-274E7.2

The impact of the sample cryopreservation on RNA-seq experiments

Cryopreservation usually results in an increased proportion of damaged cells, which could potentially affect the outputs of RNA-seq experiments. To explore the effect of of sample cryopreservation on RNA-seq output, we investigated the differences between two groups of fresh (i.e. unfrozen group) and cryopreserved (i.e. frozen group) CD4+ blood samples derived from 4 healthy patients, where each subject has both fresh and cryopreserved sample. Figure 13 show an overview of the samples preparation and groups.

Figure 13: Sample Preparation Overview

Looking at the percentage of mapped reads across different genomic regions shows no effect of sample cryopreservation on the RNA-seq experiment.

df_read <- read_dist[(read_dist$cond == "f") | (read_dist$cond == ""), c("Sample","factor","donor", "ID", "cond",pct_cols )]

df_read <- gather(df_read, key="region", value="prct", X3_utr_exons_tag_pct:other_intergenic_tag_pct )
df_read$region <- gsub("_tag_pct", "", df_read$region)
old_regions <- unique(df_read$region)
new_regions <- c("3'UTR_Exons", "TES_down_5kb", "TSS_up_5kb","TES_down_1kb","CDS_exons", "5'UTR_exons","TSS_up_10kb","Introns","TSS_up_1kb","TES_down_10kb","Other_Intergenic")
mappings <- setNames(new_regions, old_regions)
args <- c(list(df_read$region), mappings)
df_read$region <- do.call(recode, args)

ggplot(df_read, aes(x=donor, y=prct, fill=region)) +
        scale_fill_brewer( type='div', palette = "RdYlBu") +
         geom_bar(position="stack", stat="identity") + 
        theme(plot.title = element_text(hjust = 0.5, face='bold'),
              legend.position="bottom", axis.text=element_text(size=10, face='bold'),
              axis.title=element_text(hjust=0.5, size=12,face="bold"), 
              axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+ 
  xlab("Sample") + ylab("Percentages") +labs(title="Read Distribution")

The gene body coverage shows a bias towards the 5’ end for fresh/unfrozen samples, while the frozen samples show a bias towards the 3’ end. But overall the gene body coverage follows the same trend across all the samples.

library("rjson")
data <- fromJSON(file = "../project/multiqc/multiqc_data/multiqc_data.json")

gene_body_prc <- data$report_plot_data$rseqc_gene_body_coverage_plot$datasets[[1]]
gene_body_prc <- lapply(gene_body_prc, function(x) x$data)
gene_body_prc <- lapply(gene_body_prc,function(x) sapply(x, function(s) s[2]))

df <- data.frame(gene_body_prc)
colnames(df) <- read_dist$donor
df$x <- seq(1:100)
df <- gather(df, key = "Sample", value = "gene_body_prc", -x)
df[c('ID', "cond")]<- t(sapply(df$Sample, function(s) substring(s, c(1,4), c(3,6))))

df_frozen <-  df[(df$cond == "f") | (df$cond == "") , ]
df_frozen <- df_frozen %>%
     mutate(cond=replace(cond, cond=="f", "Frozen")) %>%
     mutate(cond=replace(cond, cond=="", "Unfrozen"))

p1 <- ggplot(df_frozen, aes(x = x, y = gene_body_prc, group=Sample, color=cond)) + 
  geom_line()+ 
  theme( plot.title = element_text(hjust = 0.5, face='bold'),
         axis.text=element_text(size=9, face='bold'),
         axis.title=element_text(size=12,face="bold"), 
         axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+ 
  xlab("Gene Body Percentile (5' -> 3')") + ylab("%Coverage") +labs(title="Gene Body Coverage")+
  scale_colour_brewer( type='div', palette = "Set1")

p1

Gene Level Charachterization

The density distribution of gene expression values was highly concordant between all the samples. The expression of the genes that are either protein coding or lincRNA was similar between all the samples as shown in the figures below.

cpm <- data.frame(assay(DESeq.ds, "CPM") )
colnames(cpm) <- colnames(DESeq.ds)
cpm$GENEID <- rownames(cpm)
df_cpm <- cpm %>%
  gather(colnames(cpm), key = "Sample", value = "CPM", -GENEID)
df_cpm[c('ID', "cond")]<- t(sapply(df_cpm$Sample, function(s) substring(s, c(1,4), c(3,6))))
df_cpm <- df_cpm %>%
     mutate(cond=replace(cond, cond=="f", "Frozen")) %>%
     mutate(cond=replace(cond, cond=="", "Unfrozen"))

ggplot(df_cpm, aes(x=CPM,group=Sample, color=cond )) + 
  geom_density() +
  scale_color_brewer( type='div', palette = "Set1") +
  geom_rug(sides="b") + 
  scale_x_log10()+
  theme(plot.title = element_text(hjust = 0.5, face='bold'))+
  labs(title='Density distribution of the gene CPM value of all samples', caption = "Frozen:cryopreserved samples, Unfrozen: fresh samples")+
  ylab('Density')

p1 <- ggplot(df_cpm, aes(y=log10(CPM), x=Sample, fill=cond )) +
  geom_boxplot() +
  scale_fill_brewer( type='div', palette = "Set1") +
  theme(axis.text=element_text(size=9, face='bold'),
        axis.title=element_text(size=12,face="bold"), 
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title="Distribution of the gene CPM value of all samples")

df_cpm2 <- merge(x= df_cpm, y= gene_details, by = "GENEID", all.x = TRUE)

p2 <- ggplot(df_cpm2[!is.na(df_cpm2$TXBIOTYPE), ], aes(y=log10(CPM), x=TXBIOTYPE, fill=Sample)) +
  geom_boxplot() +
  scale_fill_brewer( type='div', palette = "RdYlBu") +
  theme(plot.title = element_text(hjust = 0.5, face='bold'),
        axis.text=element_text(size=11, face='bold'),
        axis.title=element_text(size=12,face="bold"), 
        axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))+
  xlab(label = "") + labs(title= "Comparison of the CPM values between\nlincRNAs and protein-coding genes ", caption = "Frozen:cryopreserved samples, Unfrozen: fresh samples")

(p1 | p2)  + plot_annotation(tag_levels = 'A')

The PCA plot suggest that the two RNA-seq samples from same individuals were more similar than those under same cryopreserved conditions. The same thing is shown by looking at the pearson correlation matrix between the samples and performing hierarchical clustering, except for the fresh sample from donor 501 that was clustered away from its frozen sample as shown in the figures below.

# For PCA analysis the rlog normalized counts are used.

DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)

rv <- rowVars(assay(DESeq.rlog))

# Obtain the indecies of the top 500 variable genes
top_variable <- order(rv, decreasing = TRUE)[seq_len(500)]

# Compute the PCAs based on the rlog normalized gene expression
pca <- prcomp(t(assay(DESeq.rlog)[top_variable, ]))

# data.frame contains informtion for each sample. it will be used for PCA plot
data <- data.frame(colData(DESeq.ds))
# rename columns 
colnames(data) <- c("factor","donor", "ID", "Condition" )

# Plot the two top PCs
p1 <- autoplot(pca, data=data, colour = 'Condition', label = TRUE, label.size = 4, shape = FALSE) +
  scale_colour_brewer( type='qual', palette = "Set1") +
  theme_classic()+
  theme(plot.title = element_text(hjust = 0.5, face='bold'),
        legend.position="bottom") + 
  labs(title="PCA plot for Cryopreservation Imapct Samples", caption = "Frozen:cryopreserved samples, Unfrozen: fresh samples")

p1

sample_df <- as.data.frame(colData(DESeq.ds)[c('ID','cond')])
colnames(sample_df) <- c('Donor ID', 'Condition')
ann_colors = list(Condition = c(Frozen = "#8DA0CB", Unfrozen = "#FC8D62"))

corr_coeff <- cor(assay(DESeq.rlog), method = "pearson") 

p1 <- as.dist(1-corr_coeff, upper = TRUE) %>% 
  as.matrix %>%
pheatmap::pheatmap(., main = "Pearson Distance", annotation_row = sample_df, annotation_colors = ann_colors)

# Run pvclust 
d.pv <- as.dendrogram(pvclust(corr_coeff, nboot=10, quiet=T)$hclust)
ddata <- dendro_data(d.pv, type = "rectangle")
# Get data frames to plot
df_seg <- segment(ddata)
df_labs <- data.frame(label(ddata), ID = as.factor(sample_df[match(label(ddata)$label, rownames(sample_df)), "Donor ID"]))

# Create ggplot dendrogram
p <- ggplot()+ 
  geom_segment(data = df_seg,
               aes(x = x, y = y, xend = xend, yend = yend),
               size = 1.25,
               colour = "darkgray", lineend = "round") +
  geom_text(data = df_labs, aes(x = x, y = y, label = label, colour = ID), 
            nudge_y = 0,
            family = "serif", size = 4, angle = 90, hjust = 1, fontface='bold') + 
  scale_colour_brewer( type='div', palette = "Dark2") +
  xlab("") + ylab("Height")
p <- p + theme(axis.line.x = element_blank(),
               axis.text.x = element_blank(),
               axis.ticks.x = element_blank(),
               text = element_text(family = "serif")) +
  # theme_gray()+
  scale_y_continuous(expand = expand_scale(add = c(0.25, 0.5)))+ 
  theme_classic() + 
  theme(axis.line.x = element_blank(), axis.ticks.x =element_blank(), 
        axis.text.x= element_blank(),legend.position = "none", 
        plot.title = element_text(hjust = 0.5, face='bold')) +
  labs(title = "Cluster Dendrogram")

p

Lastly, we looked at the differential gene expression between the fresh and frozen group. We found 71 differentially expressed genes between groups. This suggest that sample cryopreservation has introduced some biases in the samples. This requires a look at the genes that were differentially expressed and see if they are random or they share a similar characteristics. A summary of the DE analysis is shown in the table below.

colData(DESeq.ds)$cond <- factor(as.character(colData(DESeq.ds)$cond), levels=c("Unfrozen", "Frozen"))

res <- DESeq(DESeq.ds)
res_df <- as.data.frame(results(res))
res_df$SYMBOL <- gene_symbol[match(rownames(res_df), names(gene_symbol))]
res_df <- res_df[(res_df$padj < 0.05) & (!is.na(res_df$padj)), ]
res_df <- res_df[order(res_df$padj),]
kable(res_df) %>% kable_styling() %>% scroll_box(width = "100%", height = "200px")
baseMean log2FoldChange lfcSE stat pvalue padj SYMBOL
ENSG00000247982 122.9535 2.0932760 0.3401249 6.154433 0.0000000 0.0000062 LINC00926
ENSG00000210140 577.2135 1.5977724 0.2633113 6.067999 0.0000000 0.0000062 MT-TC
ENSG00000125740 496.4931 3.6827282 0.6200794 5.939123 0.0000000 0.0000092 FOSB
ENSG00000210144 274.2191 1.7720618 0.3112996 5.692464 0.0000000 0.0000302 MT-TY
ENSG00000090104 263.8083 2.2489824 0.4162041 5.403556 0.0000001 0.0001261 RGS1
ENSG00000198695 20494.6069 1.5550274 0.2982592 5.213678 0.0000002 0.0002977 MT-ND6
ENSG00000071537 1523.1205 -0.3828004 0.0754102 -5.076243 0.0000004 0.0005306 SEL1L
ENSG00000163534 139.1476 2.5208469 0.5048343 4.993415 0.0000006 0.0006359 FCRL1
ENSG00000101347 9278.7095 -0.3826888 0.0766123 -4.995137 0.0000006 0.0006359 SAMHD1
ENSG00000248527 3981.3000 1.6383975 0.3317795 4.938212 0.0000008 0.0007607 MTATP6P1
ENSG00000177606 1461.6758 3.4222802 0.7126031 4.802505 0.0000016 0.0013743 JUN
ENSG00000140526 1877.2060 -0.4907780 0.1026225 -4.782361 0.0000017 0.0013929 ABHD2
ENSG00000225630 2106.6768 1.5948404 0.3399511 4.691382 0.0000027 0.0016878 MTND2P28
ENSG00000237973 267.2047 1.7148056 0.3660168 4.685046 0.0000028 0.0016878 MTCO1P12
ENSG00000136485 1902.1383 -0.5928479 0.1254639 -4.725246 0.0000023 0.0016878 DCAF7
ENSG00000198840 15925.4121 1.4773019 0.3143308 4.699831 0.0000026 0.0016878 MT-ND3
ENSG00000007312 310.0087 0.6064963 0.1303061 4.654398 0.0000032 0.0018035 CD79B
ENSG00000137409 2758.0218 0.3615614 0.0778020 4.647201 0.0000034 0.0018035 MTCH1
ENSG00000156738 343.4310 3.1099976 0.6724906 4.624596 0.0000038 0.0019059 MS4A1
ENSG00000177485 815.4753 -0.4990550 0.1100175 -4.536143 0.0000057 0.0027638 ZBTB33
ENSG00000228253 11130.3689 1.6020981 0.3542848 4.522062 0.0000061 0.0028135 MT-ATP8
ENSG00000258441 946.2335 0.5066317 0.1143183 4.431766 0.0000093 0.0040988 LINC00641
ENSG00000231721 548.1144 0.6608080 0.1531602 4.314489 0.0000160 0.0067105 LINC-PINT
ENSG00000141367 2691.6656 -0.4902580 0.1141841 -4.293577 0.0000176 0.0070679 CLTC
ENSG00000102901 1273.9638 0.3100813 0.0727763 4.260743 0.0000204 0.0078631 CENPT
ENSG00000155287 764.7964 0.3912217 0.0930692 4.203557 0.0000263 0.0090537 SLC25A28
ENSG00000089327 4073.0997 0.3750953 0.0891546 4.207246 0.0000259 0.0090537 FXYD5
ENSG00000005893 1168.7492 -0.4178340 0.0993964 -4.203715 0.0000263 0.0090537 LAMP2
ENSG00000178188 2098.8204 0.2853805 0.0711636 4.010200 0.0000607 0.0189172 SH2B1
ENSG00000158552 693.5175 0.3726694 0.0928264 4.014693 0.0000595 0.0189172 ZFAND2B
ENSG00000210194 150.7174 1.3518971 0.3371524 4.009750 0.0000608 0.0189172 MT-TE
ENSG00000253729 4955.7743 -0.4088103 0.1021872 -4.000601 0.0000632 0.0190494 PRKDC
ENSG00000153201 3482.5508 -0.4618723 0.1157512 -3.990217 0.0000660 0.0192997 RANBP2
ENSG00000185009 856.5723 -0.4275459 0.1085783 -3.937673 0.0000823 0.0223709 AP3M1
ENSG00000172890 2670.8198 0.2737857 0.0695198 3.938241 0.0000821 0.0223709 NADSYN1
ENSG00000171302 611.6765 -0.4030755 0.1025288 -3.931340 0.0000845 0.0223709 CANT1
ENSG00000153827 4277.3621 -0.2958935 0.0753367 -3.927615 0.0000858 0.0223709 TRIP12
ENSG00000057252 983.4377 -0.2986947 0.0761829 -3.920757 0.0000883 0.0224116 SOAT1
ENSG00000131876 1055.2962 0.3496586 0.0894054 3.910935 0.0000919 0.0227445 SNRPA1
ENSG00000115307 1756.8106 0.3464496 0.0891941 3.884220 0.0001027 0.0243501 AUP1
ENSG00000105887 2191.7146 -0.3742273 0.0963934 -3.882289 0.0001035 0.0243501 MTPN
ENSG00000175216 1204.2976 -0.4545594 0.1174169 -3.871327 0.0001082 0.0246772 CKAP5
ENSG00000135164 3461.8329 0.3078830 0.0796090 3.867441 0.0001100 0.0246772 DMTF1
ENSG00000165934 1086.4042 -0.4350308 0.1133018 -3.839575 0.0001232 0.0270248 CPSF2
ENSG00000215271 214.6026 -0.4803380 0.1259479 -3.813785 0.0001369 0.0287038 HOMEZ
ENSG00000141027 3631.4037 -0.3838474 0.1005618 -3.817031 0.0001351 0.0287038 NCOR1
ENSG00000125107 5173.1848 -0.4073286 0.1070721 -3.804245 0.0001422 0.0291980 CNOT1
ENSG00000172315 384.5109 -0.5335844 0.1411938 -3.779091 0.0001574 0.0316377 TP53RK
ENSG00000197102 8720.0053 -0.3150027 0.0836305 -3.766601 0.0001655 0.0325836 DYNC1H1
ENSG00000148341 1528.0802 0.2600918 0.0693389 3.751022 0.0001761 0.0339831 SH3GLB2
ENSG00000085719 1109.1622 -0.3666108 0.0978709 -3.745863 0.0001798 0.0340092 CPNE3
ENSG00000104957 898.1565 0.3283645 0.0879123 3.735139 0.0001876 0.0348092 CCDC130
ENSG00000111674 1424.9214 0.3238274 0.0879695 3.681134 0.0002322 0.0407319 ENO2
ENSG00000114978 2929.4948 -0.3116295 0.0845244 -3.686857 0.0002270 0.0407319 MOB1A
ENSG00000144426 350.9537 -0.4413756 0.1198982 -3.681251 0.0002321 0.0407319 NBEAL1
ENSG00000279088 306.0024 0.4482117 0.1228121 3.649572 0.0002627 0.0423861 RP11-574K11.24
ENSG00000185591 2680.0752 -0.3623868 0.0988498 -3.666033 0.0002463 0.0423861 SP1
ENSG00000100596 1231.8482 -0.5216516 0.1426137 -3.657795 0.0002544 0.0423861 SPTLC2
ENSG00000140983 2464.9095 0.3535820 0.0967378 3.655054 0.0002571 0.0423861 RHOT2
ENSG00000108582 981.6010 -0.3383162 0.0928311 -3.644427 0.0002680 0.0423861 CPD
ENSG00000087074 854.3845 0.9810162 0.2691466 3.644914 0.0002675 0.0423861 PPP1R15A
ENSG00000114331 2209.8550 -0.2915671 0.0801396 -3.638240 0.0002745 0.0427169 ACAP2
ENSG00000198712 60915.3453 1.7008518 0.4681201 3.633367 0.0002797 0.0428413 MT-CO2
ENSG00000137145 1650.1165 -0.3922689 0.1083559 -3.620191 0.0002944 0.0436960 DENND4C
ENSG00000198763 43155.6931 1.6904637 0.4667793 3.621548 0.0002928 0.0436960 MT-ND2
ENSG00000171163 779.5569 0.3860699 0.1073026 3.597955 0.0003207 0.0468847 ZNF692
ENSG00000196405 13775.8606 0.2408411 0.0670289 3.593095 0.0003268 0.0470553 EVL
ENSG00000080824 7234.8730 -0.3266137 0.0910561 -3.586951 0.0003346 0.0474691 HSP90AA1
ENSG00000134453 2216.7016 0.2388790 0.0668905 3.571193 0.0003554 0.0489786 RBM17
ENSG00000119682 1192.2457 -0.2985510 0.0835629 -3.572772 0.0003532 0.0489786 AREL1
ENSG00000108465 3239.9714 0.3933699 0.1103520 3.564684 0.0003643 0.0495030 CDK5RAP3

Discussion

  1. Figure out the right way to compare the samples from different conditions to get the answer for the our question. First, I looked at the read counts and the distribution of the reads from the QC tools, then I didn’t know what else I should inspect. Then I took some pointers from the orginal paper (L. Wang et al. 2019) to complete the analysis.
  2. Normalization: The original paper used CPM normalized counts in some part of the analysis. At first it seemed the wrong thing to do since we are comparing different samples, but I found out we can use CPM only to compare the gene count between samples of the same sample group. But when I get to clustering and DE analysis, the CPM count seemed to be influenced by the conditions not the individual patient sample, but once I used the rlog normalized counts the results suggested otherwise.
  3. Understanding the details of the sample cryopreservation experiment. There are some technical details concerning fresh RNA samples that I’m not confident about, that is how was the cell lysates used to derive fresh RNA-seq stored and how it differs from Cosmic calf serum that was used to freeze the peripheral blood mononuclear cells.

References

Buenrostro, Jason D., Beijing Wu, Howard Y. Chang, and William J. Greenleaf. 2015. ATAC-Seq: A Method for Assaying Chromatin Accessibility Genome-Wide.” Current Protocols in Molecular Biology 109 (1): 21.29.1–9. https://doi.org/https://doi.org/10.1002/0471142727.mb2129s109.
Hafner, Markus, Maria Katsantoni, Tino Köster, James Marks, Joyita Mukherjee, Dorothee Staiger, Jernej Ule, and Mihaela Zavolan. 2021. CLIP and Complementary Methods.” Nature Reviews Methods Primers 1 (1): 1–23. https://doi.org/10.1038/s43586-021-00018-1.
Kukurba, Kimberly R., and Stephen B. Montgomery. 2015. RNA Sequencing and Analysis.” Cold Spring Harbor Protocols 2015 (11): 951–69. https://doi.org/10.1101/pdb.top084970.
Marioni, John C., Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad. 2008. RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays.” Genome Research 18 (9): 1509–17. https://doi.org/10.1101/gr.079558.108.
Park, Peter J. 2009. ChIP–Seq: Advantages and Challenges of a Maturing Technology.” Nature Reviews Genetics 10 (10): 669–80. https://doi.org/10.1038/nrg2641.
Wang, Lei, Sara J. Felts, Virginia P. Van Keulen, Larry R. Pease, and Yuji Zhang. 2019. “Exploring the Effect of Library Preparation on RNA Sequencing Experiments.” Genomics 111 (6): 1752–59. https://doi.org/10.1016/j.ygeno.2018.11.030.
Wang, Zhong, Mark Gerstein, and Michael Snyder. 2009. RNA-Seq: A Revolutionary Tool for Transcriptomics.” Nature Reviews Genetics 10 (1): 57–63. https://doi.org/10.1038/nrg2484.